An introduction to coding in R

Tom Keaney

Goals

  • Setup R to work as you intend
  • Code in a reproducible, clear style
  • Gain a familiarity with key data wrangling functions
  • Become well versed in creating figures with ggplot
  • Create a document that should help you in the future

Setting up

Make your interface look nice:

Projects

File > New project

  • Existing directory: places project in an existing folder
  • New directory: creates new folder
  • Version control: handy if you want to use github

Projects are powerful:

  • R knows where to look for files
  • No need to worry about setting working directories
  • Great for sharing

Quarto

  • Open a quarto document in your new project

  • File > New File > Quarto document

  • Save the document within the project directory (where you already are)

  • Save the _quarto.yml provided in the email within this directory

  • Render the document

Packages

The base version of R can be upgraded with packages

We shall use the tidyverse collection of packages.

#install.packages("tidyverse")
#install.packages("pander")
#install.packages("patchwork")
#install.packages("MetBrewer")
#install.packages("ggridges")

library(tidyverse) # for tidy coding
library(tinytable) # for nice tables
library(patchwork) # for aligning plots
library(MetBrewer) # for nice colours to use when making figures
library(ggridges) # for nice density plots

The tidyverse

  • A collection of packages

  • All follow the same logic

  • Quite different from base R

  • “Supremely readable”

Rhamphorhynchus muensteri

The dataset

pterosaur_data <-
  read_delim("pterosaur_data.csv", delim = ";")

pterosaur_data
1
Assign a name to the dataset
2
Load the csv file and specify the separator between columns
3
Display the dataset
# A tibble: 138 × 15
   Individual_ID ORBIT SKULL  NECK TRUNK_LENGTH  TAIL HUMERUS RADIUS
           <dbl> <dbl> <dbl> <dbl>        <dbl> <dbl>   <dbl>  <dbl>
 1             1  11      40  21.5         47.5  106.    16.5   26.7
 2             2  10      35  18           NA    112     15.3   26  
 3             3  NA      NA  NA           36     85     NA     NA  
 4             4  NA      36  NA           39    115     17     28  
 5             5  NA      NA  15.5         36.5  100     14.6   23.8
 6             6  12      35  23           40    110     NA     NA  
 7             7  10      35  20           41    106     15.5   24  
 8             8  NA      31  NA           NA     NA     14.5   24  
 9             9  NA      NA  17           36     79     NA     NA  
10            10  12.5    41  23           47    125     19     32  
# ℹ 128 more rows
# ℹ 7 more variables: METACARPAL_4 <dbl>, WING_PHALANX_1 <dbl>,
#   WING_PHALANX_2 <dbl>, WING_PHALANX_3 <dbl>, WING_PHALANX_4 <dbl>,
#   FEMUR <dbl>, TIBIA <dbl>

Rhamphorhynchus

Some intriguing patterns

The key to tidyverse coding: %>%

This weird symbol is called a pipe

  • You should read this as then

  • do this, then do this…

  • allows you to chain your code

pterosaur_data %>% view()

In common language this means: load the pterosaur data then view it in a new tab

The holy trinity

  • select(): order, rename or drop columns

  • filter(): keep or remove specific rows

  • mutate(): create new columns or edit existing ones

If you ever need help with a function, ? is your friend

# An example of how to get some help
?select 

select()

Removing columns you aren’t interested in:

pterosaur_data %>% select(Individual_ID, TAIL)

pterosaur_data %>% select(!c(Individual_ID, TAIL))

pterosaur_data %>% select(contains("WING"))

pterosaur_data %>% select(1, 5)
1
the ! reverses the statement
2
contains chooses columns with names that contain a pattern
3
Dangerous coding! Avoid.

Changing column names:

pterosaur_data %>% select(Specimen = Individual_ID)

# if you want to keep all other columns

pterosaur_data %>% select(Specimen = Individual_ID, everything())

# a recommended alternative

pterosaur_data %>% rename(Specimen = Individual_ID)

select() use cases

  1. Create a new dataset that only contains the ID of the individual and wing measurements for phalanxs 2, 3 and 4.

  2. Returning to the original data, remove the measurements for wing phalanx 2 and 4

  3. Why does this cause an error?

pterosaur_data %>% select(contains(WING))

filter()

Choosing rows of interest

Large_data <- pterosaur_data %>% filter(TAIL > 200)

ten_cm_tails <- pterosaur_data %>% filter(TAIL == 100)

long_tails_and_small_heads <- 
  pterosaur_data %>% filter(TAIL > 200 & SKULL < 90)

long_tails_or_small_heads <- 
  pterosaur_data %>% filter(TAIL > 200 | SKULL < 90)
1
== is needed to filter
2
| indicates or

Dealing with NA values

# remove NAs in single column
pterosaur_data %>% filter(!is.na(ORBIT))

# remove all rows with NAs
pterosaur_data %>% filter_at(vars(2:15), all_vars(!is.na(.))) 

filter() use cases

  1. Find pterosaurs that have longer necks than humerus’

  2. Returning to the original data, remove measurements with NA TRUNK_LENGTH values, for individuals with IDs greater than 50

  3. Trim the data to only include SKULL lengths between 60 and 90mm

  4. Find the individuals with the maximum and minimum tail lengths

  5. Find the individuals with tail lengths above the mean of the sampled population

filter(): handy operators

  • == = equal to
  • & = and
  • | = or
  • ! = does not
  • > = greater than
  • < = less than

mutate(): modifying existing columns

Let’s change the units of measurement to centimetres

pterosaur_data_cm <- pterosaur_data %>% mutate(ORBIT = ORBIT/10)

Now add 10cm to each orbit measurement (but don’t save this!)

pterosaur_data %>% mutate(ORBIT = ORBIT + 10)

mutate(): creating new columns

The total length of a wing is roughly the sum of the lengths of the humerus, radius, fourth metacarpal and the four wing phalanxs. With mutate(), we can calculate this and add it to the dataset:

pterosaur_data <-
  pterosaur_data %>% 
  mutate(single_wing_length = 
           HUMERUS + RADIUS + METACARPAL_4 + WING_PHALANX_1 + 
           WING_PHALANX_2 + WING_PHALANX_3 + WING_PHALANX_4) %>% 
  select(Individual_ID, single_wing_length, everything())

Conditional mutation

Can we place individuals into phenotypic classes?

Conditional mutation

pterosaur_data_classes <-
  pterosaur_data %>% 
  mutate(Size_class = case_when(
    single_wing_length < 300 ~ "Small",
    single_wing_length >= 300 ~ "Large",
    .default = "Unknown"))

pterosaur_data_classes %>% 
  select(Individual_ID, Size_class, single_wing_length)
1
For this subset of cases…
2
For a second subset of cases…
3
For all remaining cases…
# A tibble: 138 × 3
   Individual_ID Size_class single_wing_length
           <dbl> <chr>                   <dbl>
 1             1 Small                    183.
 2             2 Small                    174.
 3             3 Unknown                   NA 
 4             4 Small                    189.
 5             5 Small                    166.
 6             6 Unknown                   NA 
 7             7 Small                    164.
 8             8 Unknown                   NA 
 9             9 Unknown                   NA 
10            10 Small                    221 
# ℹ 128 more rows

Conditional mutation

It’s also possible to mutate a single row

pterosaur_data %>% 
  mutate(Sex = case_when(Individual_ID == 1 ~ "Special",
                         .default = "Ordinary"))

Build your phenotypic classes

  • Not every individual has a recorded wing length.

  • But there are other morphological traits in the dataset

  • Create a classification criteria and implement it

Bonus content: summarise()

  • The logic of mutate() can be extended to summarise() row values
  • Rows can be grouped to summarise conditionally using the group_by function
pterosaur_data_classes %>% 
  group_by(Size_class) %>% 
  summarise("Wing length" = mean(single_wing_length, na.rm = T))
1
mean() has a built-in way to deal with NA values
# A tibble: 3 × 2
  Size_class `Wing length`
  <chr>              <dbl>
1 Large               490.
2 Small               197.
3 Unknown             NaN 

Your task

  1. Split pterosaurs into phenotypic classes and remove those you can’t categorise
  2. Trim the dataframe to only include class, skull, length, wing length and tail length
  3. Summarise the data to show the mean for morphological traits, for each class
  4. Convert to cm and round to zero decimal places

Focus on writing clear code, with comments (using the #) accompanying each important step.

Hint: the round() function can be used inside mutate()

Table making

Once complete, pass your polished dataframe to this function with the %>% to make a neat table

# your dataframe goes here %>% 
 tt()
1
See ?tinytable::tt

Expanding our vocabulary

  • distinct()

  • slice() and slice_sample()

  • n()

  • bind_rows()

The tidyverse ‘for loop’

In the tidyverse, the accumulate() function replaces the for loop. Here’s how it works

  • Say we want to work out debt accumulated on a loan over a 6 month period

  • The monthly interest rate is 10%, or if you’re unlucky 20%

tibble(month = 1:6,
       debt_10 = accumulate(1:5, .init = 100, ~ .x*1.1),  
       debt_20 = accumulate(1:5, .init = 100, ~ .x*1.2)) 
  • ~ .x * 1.1 adds 10% to each element, starting from the initial value 100

accumulate() use cases

  1. Imagine now that instead of money, you need to track the number of pterosaurs in a population. That population starts off with 400 individuals and decreases by 5% each year. Track the population for 25 years.

  2. Now consider migration. Each year, 10 individuals enter the population from a neighbouring source population.

Visualising data

  • At its core, science communication is most effective through visual mediums

  • The ggplot2 package is included in the tidyverse

ggplot()

  • Build plots one layer at a time

  • Layers are added on top of one another

  • New layers are added with the + symbol

  • + == %>% in ggplot-land

Getting started

pterosaur_data_classes %>% 
  ggplot(aes())
  • ggplot() provides an empty canvas
  • aes determines how variables are mapped to visual aesthetics

Building a geom_histogram()

pterosaur_data_classes %>% 
  ggplot(aes(x = SKULL/10, fill = Size_class)) +
  geom_histogram(binwidth = 0.1) 

Fix the labels

pterosaur_data_classes %>% 
  ggplot(aes(x = SKULL/10, fill = Size_class)) +
  geom_histogram(binwidth = 0.1) +
  labs(x = "Skull length (cm)", y = "No. individuals", fill = "Size class")

Fix the theming

pterosaur_data_classes %>% 
  ggplot(aes(x = SKULL/10, fill = Size_class)) +
  geom_histogram(binwidth = 0.1) +
  labs(x = "Skull length (cm)", y = "No. individuals", fill = "Size class") +
  theme_classic() + # new
  theme(panel.grid.major = element_line(), # new
        text = element_text(size= 14)) # new

Fix the axis

pterosaur_data_classes %>% 
  ggplot(aes(x = SKULL/10, fill = Size_class)) +
  geom_histogram(binwidth = 0.1) +
  labs(x = "Skull length (cm)", y = "No. individuals", fill = "Size class") +
  scale_x_continuous(expand = c(0, 0), # new
                     breaks = c(0, 4.0, 8.0, 12.0, 16.0, 20.0), # new
                     limits = c(0, 20.0)) + # new
  scale_y_continuous(expand = c(0, 0)) + # new
  theme_classic() + 
  theme(panel.grid.major = element_line(),
        text = element_text(size= 14))

Change the colours

Check out met.brewer

pterosaur_data_classes %>% 
  ggplot(aes(x = SKULL/10, fill = Size_class)) +
  geom_histogram(binwidth = 0.1) +
  labs(x = "Skull length (cm)", y = "No. individuals", fill = "Size class") +
  scale_x_continuous(expand = c(0, 0),
                     breaks = c(0, 4.0, 8.0, 12.0, 16.0, 20.0),
                     limits = c(0, 20.0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_fill_manual(values = c(met.brewer("Monet")[2], met.brewer("Monet")[8])) + # new
  theme_classic() +
  theme(panel.grid.major = element_line(),
        text = element_text(size= 14))

Change to geom_density()

pterosaur_data_classes %>% 
  ggplot(aes(x = SKULL/10, fill = Size_class)) + 
  geom_density(colour = NA, alpha = 0.7) + # new
  labs(x = "Skull length (cm)", y = "No. individuals", fill = "Size class") +
  scale_fill_manual(values = c(met.brewer("Monet")[2], met.brewer("Monet")[8])) +
  scale_x_continuous(expand = c(0, 0),
                     breaks = c(0, 4.0, 8.0, 12.0, 16.0, 20.0),
                     limits = c(0, 20.0)) +
  scale_y_continuous(expand = c(0, 0)) +
  theme_classic() +
  theme(panel.grid.major = element_line(),
        text = element_text(size= 14))

The scatterplot: geom_point()

pterosaur_data_classes %>%
  ggplot(aes(x = ORBIT, y = WING_PHALANX_1)) +
  geom_point()

Make improvements

pterosaur_data_classes %>%
  ggplot(aes(x = ORBIT/10, y = WING_PHALANX_1/10)) +
  geom_point() +
  labs(x = "Orbit length (cm)", y = "First wing phalanx\nlength (cm)") +
  scale_fill_manual(values = c(met.brewer("Monet")[2], met.brewer("Monet")[8])) +
  scale_x_continuous(expand = c(0, 0), limits = c(0, 5)) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 22)) +
  theme_classic() +
  theme(panel.grid.major = element_line(),
        text = element_text(size= 14))

geom_point()

pterosaur_data_classes %>% filter(Size_class != "Unknown") %>% 
  ggplot(aes(x = ORBIT/10, y = WING_PHALANX_1/10)) +
  geom_point(aes(fill = Size_class), shape = 21, size = 5, alpha = 0.8, 
             colour = "black") +
  labs(x = "Orbit length (cm)", y = "First wing phalanx\nlength (cm)", 
       fill = "Size class") +
  scale_fill_manual(values = c(met.brewer("Hiroshige")[4], 
                               met.brewer("Hiroshige")[6])) +
  scale_x_continuous(expand = c(0, 0), limits = c(0, 5)) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 22)) +
  theme_classic() +
  theme(panel.grid.major = element_line(),
        text = element_text(size= 14))

Continuous vs discrete colours

pterosaur_data_classes %>% filter(Size_class != "Unknown") %>% 
  ggplot(aes(x = ORBIT/10, y = WING_PHALANX_1/10)) +
  geom_point(aes(fill = ORBIT/10), shape = 21, size = 5, alpha = 0.9, 
             colour = "black") +
  labs(x = "Orbit length (cm)", y = "First wing phalanx\nlength (cm)", 
       fill = "Orbit\nlength") +
  scale_fill_gradientn(colors=met.brewer("Hiroshige", direction = -1)) +
  scale_x_continuous(expand = c(0, 0), limits = c(0, 5)) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 22)) +
  theme_classic() +
  theme(panel.grid.major = element_line(),
        text = element_text(size= 14))

Line plots

Let’s return to our crippling interest rates

line_plot <-
  tibble(month = 0:20,
       `10%` = accumulate(1:20, .init = 5, ~ .x*1.1),  
       `20%` = accumulate(1:20, .init = 5, ~ .x*1.2)) %>% 
  pivot_longer(2:3, names_to = "Interest_rate", values_to = "Amount_owed") %>% 
  # plot
  ggplot(aes(x = month, y = Amount_owed, 
             group = Interest_rate, colour = Interest_rate)) +
    geom_line(linewidth = 2)

line_plot

Make it beautiful

line_plot +
  labs(x = "Months", y = "Dollars owed", 
       colour = "Interest rate") +
  scale_colour_manual(values = c(met.brewer("VanGogh3")[2], 
                               met.brewer("VanGogh3")[6])) +
  scale_x_continuous(expand = c(0, 0), limits = c(0, 20)) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 210)) +
  theme_classic() +
  theme(panel.grid.major = element_line(),
        text = element_text(size= 28))

General tips

  • alpha: changes the transparency

  • fill: colours the inside of elements

  • colour: colours the outlines of elements

If you want these to change with your data, place them inside aes()

Many more plot styles can be found here

The power of Quarto

Tom’s supplementary material